## Data Wrangling ##
#1. Create a version of the data containing baseline risk variables and longitudinal outcome variables for clustering EDA
clustering_eda_data <- full_join(risk_variable_data, outcome_variable_data) %>%
dplyr::select(-site, -age, -race_ethnicity, -sex, -family_id)
#2. Display distribution of data
clustering_eda_data %>%
skimr::skim()
| Name | Piped data |
| Number of rows | 83378 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 24 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| participant_id | 0 | 1 | 12 | 12 | 0 | 11878 | 0 |
| session_id | 0 | 1 | 7 | 7 | 0 | 9 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mh_p_cbcl__dsm__dep_tscore | 16349 | 0.80 | 54.71 | 6.58 | 50.00 | 50.00 | 51.00 | 57.00 | 96.00 | ▇▂▁▁▁ |
| mh_p_cbcl__dsm__anx_tscore | 15888 | 0.81 | 54.11 | 6.37 | 50.00 | 50.00 | 51.00 | 55.00 | 100.00 | ▇▁▁▁▁ |
| mh_p_cbcl__synd__attn_tscore | 2944 | 0.96 | 53.47 | 5.66 | 50.00 | 50.00 | 51.00 | 55.00 | 100.00 | ▇▁▁▁▁ |
| mh_p_cbcl__synd__aggr_tscore | 2938 | 0.96 | 52.18 | 4.79 | 50.00 | 50.00 | 50.00 | 52.00 | 100.00 | ▇▁▁▁▁ |
| mh_p_gbi_sum | 74185 | 0.11 | 1.22 | 2.65 | 0.00 | 0.00 | 0.00 | 1.00 | 28.00 | ▇▁▁▁▁ |
| mh_y_upps__nurg_sum | 74185 | 0.11 | 8.48 | 2.63 | 4.00 | 7.00 | 8.00 | 10.00 | 16.00 | ▆▇▇▂▁ |
| mh_y_upps__purg_sum | 74185 | 0.11 | 7.94 | 2.93 | 4.00 | 6.00 | 8.00 | 10.00 | 16.00 | ▇▆▆▂▁ |
| le_l_coi__addr1__coi__total__national_zscore | 74185 | 0.11 | 0.01 | 0.03 | -0.11 | -0.01 | 0.02 | 0.04 | 0.08 | ▁▂▅▇▅ |
| fc_p_nsc__ns_mean | 74185 | 0.11 | 3.92 | 0.95 | 1.00 | 3.33 | 4.00 | 4.67 | 5.00 | ▁▁▃▅▇ |
| sds_total | 74185 | 0.11 | 36.31 | 8.05 | 26.00 | 31.00 | 34.00 | 40.00 | 126.00 | ▇▁▁▁▁ |
| family_history_depression | 74185 | 0.11 | 0.31 | 0.46 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| family_history_mania | 74185 | 0.11 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| bullying | 74185 | 0.11 | 0.25 | 0.44 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| nc_y_nihtb__lswmt__uncor_score | 74185 | 0.11 | 97.04 | 11.95 | 36.00 | 90.00 | 97.00 | 105.00 | 136.00 | ▁▁▅▇▁ |
| nc_y_nihtb__flnkr__uncor_score | 74185 | 0.11 | 94.19 | 9.02 | 51.00 | 90.00 | 96.00 | 100.00 | 116.00 | ▁▁▃▇▂ |
| nc_y_nihtb__pttcp__uncor_score | 74185 | 0.11 | 88.36 | 14.48 | 30.00 | 80.00 | 88.00 | 99.00 | 140.00 | ▁▂▇▅▁ |
| ACE_index_sum_score | 74185 | 0.11 | 1.95 | 1.35 | 0.00 | 1.00 | 2.00 | 3.00 | 7.00 | ▇▅▆▁▁ |
| si_passive | 1839 | 0.98 | 0.09 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| si_active | 1839 | 0.98 | 0.08 | 0.27 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| sa | 1839 | 0.98 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| nssi | 1839 | 0.98 | 0.07 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| bipolar_I | 40868 | 0.51 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| bipolar_II | 40868 | 0.51 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| any_bsd | 40868 | 0.51 | 0.08 | 0.27 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
Before employing clustering methods to identify meaningful latent groupings related to bipolar disorder and suicidality outcomes, it is critical to visually assess whether baseline risk variables demonstrate meaningful Euclidean separability. In other words, we must verify whether the baseline risk variables, both continuous (e.g., CBCL DSM-5 depression scores) and binary (e.g., family history of depression), form distinct, visually identifiable groups when plotted in pairs and colored according to longitudinal outcomes across assessment timepoints (baseline through 6-year follow-up).
This initial visual evaluation is foundational, as it determines whether clustering methods, which rely heavily on Euclidean or similar distance measures, are suitable for capturing latent risk groups that meaningfully predict clinical outcomes.
We will thus generate bi-plots of the 5 baseline risk variables most associated (assessed via point-biserial correlation) with each color-coded outcome variable (i.e., binary diagnostic outcomes and continuous CBCL scores) at every assessment timepoint for which they are available:
Before committing to any clustering solution, we want to further assess (beyond biplots) whether our 16 baseline risk variables can meaningfully distinguish outcome groups in Euclidean space. Two questions from Matt Sullivan addressed herein guide this process:
Univariate separability: Does a single feature (e.g., Sleep Disturbance) alone already classify “No” vs “Yes” for diagnoses of interest?
Feature ranking: Which variables show the strongest separation and warrant emphasis?
Goal
Demonstrate on a concrete example (i.e., baseline Sleep Disturbance vs. bipolar_I at the 6-year follow-up) that a single feature can or cannot already separate outcome groups. If it does:
Formal test confirms group means differ
Two modes become the natural 1-D cluster centroids
Why bipolar_I @ 6Y? I picked this wave and outcome because bipolar_I at year 6 is our longest-term key clinical endpoint with good sample size and data quality. “Success” here means Sleep Disturbance has predictive value for future mania onset (exactly the kind of univariate signal clustering would leverage)
Steps & Rationale
- Shapiro–Wilk in each group → if both p > .05 use Welch t-test; else Mann–Whitney U
- Cohen’s d quantifies effect size (how far apart the “No” vs “Yes” means really are)
- Fit Gaussian mixture models (GMMs) with 1, 2, 3 components on the pooled Sleep Disturbance scores
- ΔBIC + likelihood-ratio test tells us if two modes are statistically justified
- The two-component GMM’s means are exactly where a 1-D k-means (k = 2) algorithm would place its centroids
- In 1-D, k-means/GMM centroids land at the two modes; the decision boundary is the midpoint
- Thus, if Sleep Disturbance alone separates “No” vs “Yes,” it already behaves as a near-perfect univariate classifier
To note, the same workflow (Shapiro → test → Cohen’s d → GMM → 1-D centroids) can be applied to any other risk feature and any other follow-up outcome/timepoint
Fit GMM with 1 component (null model)
Fit GMM with 2 components (alternative model)
Fit GMM with 3 components (for BIC comparison)
| Value | Comment | |
|---|---|---|
| Mann-Whitney U | 316093.5 | p<.001: groups differ |
| Cohen’s d | 0.3 | mod/small |
| ΔBIC (2−1) | 2381.1 | strong support 2 modes |
| LRT p-value | 0.0 | prefer 2-component |
Plot breakdown: Youth who develop BD-I by 6-year follow-up (“Yes,” red) show a modest rightward shift in Sleep Disturbance at baseline relative to those who do not (“No,” blue). A Mann–Whitney U test confirms the groups differ (p < .001) with Cohen’s d ≈ 0.3. Fitting a two-component Gaussian mixture uncovers two distinct modes (dashed lines) that serve as the natural 1-D k-means centroids, demonstrating that Sleep Disturbance alone yields a somewhat intuitive univariate clustering boundary
Answer to Q1: Sleep Disturbance Univariate Separability
Formal group‐difference test:
Mann–Whitney U p < .001 confirms the “Yes” vs “No” groups differ on baseline Sleep Disturbance
Cohen’s d ≈ 0.3 indicates a small‐to‐moderate mean shift, matching the partial overlap in the histograms
Normal vs two‐normal comparison:
A two‐component Gaussian mixture is strongly preferred (ΔBIC ≈ 2381; LRT p ≈ 0), so there truly are two modes
Those component means (dashed lines) coincide with where a 1-D k-means (k=2) would place its centroids
Clustering interpretation in 1-D:
In one dimension, k-means/GMM centroids sit at the two modes and classify cases by the midpoint.
Thus Sleep Disturbance alone already yields a natural 2-cluster solution—an almost-perfect univariate classifier, though with only modest discrimination (d≈0.3)
Goal
Identify which baseline risk features show at least medium univariate association with our key outcome (here, bipolar_I at 6 years), so we know which variables carry the strongest marginal signal. We still retain all 16 features in the master set—this ranking only tells us which deserve a closer look first.
Why bipolar_I @ 6 Y?
This is our longest‐term clinical endpoint with the largest sample and best data completeness. A variable that shows even a medium effect here is a strong candidate for driving cluster structure.
Steps & Rationale
– |d| > 0.5 flag ⇒ medium effect; > 0.8 ⇒ large effect
Rank all features by |d|
Short-list those with |d| > 0.5 for further 2-D checks (but keep the full set for clustering)
Note: the same pattern (compute |d| for binary, η² for CBCL‐quintiles) can be applied to any other outcome or to continuous outcomes turned into quintile groups
| Risk Variable | Cohen’s d | Abs. Value d | Effect Size |
|---|---|---|---|
| CBCL Depression T | -0.52 | 0.52 | medium |
| ACE Index | -0.39 | 0.39 | small |
| Sleep Disturbance | -0.30 | 0.30 | small |
| CBCL Anxiety T | -0.29 | 0.29 | small |
| CBCL Attention T | -0.26 | 0.26 | small |
| Fam Hx Dep | -0.24 | 0.24 | small |
| CBCL Aggression T | -0.23 | 0.23 | small |
| GBI Mania Score | -0.22 | 0.22 | small |
| UPPS Neg Urgency | -0.22 | 0.22 | small |
| UPPS Pos Urgency | -0.16 | 0.16 | small |
| Bullying | -0.15 | 0.15 | small |
| NIHTB Flanker | -0.14 | 0.14 | small |
| NIHTB Working Mem | -0.09 | 0.09 | small |
| NIHTB Proc Speed | -0.07 | 0.07 | small |
| Fam Hx Mania | -0.07 | 0.07 | small |
| Child Opportunity Z | -0.02 | 0.02 | small |
| Neighborhood Safety | -0.02 | 0.02 | small |
Only baseline CBCL Depression T shows a medium univariate effect (|d| = 0.52) for associations with future Bipolar I at 6-year follow-up; every other baseline risk factor ranks in the small range (|d| from 0.39 down to 0.02), with ACE Index (|d| = 0.39) and Sleep Disturbance (|d| = 0.30) as the next strongest
This tells me two things:
No single predictor can strongly separate “No” vs “Yes” cases on its own; every variable carries modest signal
A multivariate approach like clustering in the full 16-dimensional risk space (I think) can likely harness the combined power of these small‐effect variables (and their interactions) to form more discriminative risk groupings than any one variable alone; and the bi-plots above show that this may be possible to some degree